Introduction

Dynamic applications in ViEWS serve as important forecasts in their own right. Since the data are inherently serially correlated (in space, time and both) it makes sense that one look at and consider dynamic forecasts as a baseline.

Above the density baselines proposed as part of ViEWS 2.0, the forecasts proposed here provide Bayesian density forecasts that allow for the evaluation of dynamics (autoregression, random walks [with and without drift], and time trends) and for different distributional assumptions (e.g., Poisson, negative binomial, zero-inflated, Tweedie). The idea here is that the forecasts proposed should be baseline in the sense that no-change or density forecasts account for the basic properties of the data (beyond which covariates can then be considered by either LASSO or elastic-net additions, which we leave for others).

The goal here is to take a fully Bayesian approach that admits that for the conflict forecaster is uncertain about the following choices:

  1. Data in the sample / training: this in the main is addressed by the design of the forecasting competition itself.

  2. Choice of the parametric density (e.g., Poisson, compound Poisson, negative binomial). A review of the papers from the previous event shows that this is a question to be considered and especially in light of the forecast baseline being a simple Poisson distribution (to be estimated based on the training sample chosen above).

  3. Specification of the dynamics: autoregression, local trends, global trends, relative weighting of each. In the earlier analyses of the pgm and cm data various approaches to nearness in space-time were considered for how serially correlated the events predicted may be. The idea here is to benchmark and systematize how one at least thinks about this in the time dimension

  4. Evaluation of the forecast densities: focus on distributions, not on single quantiles nor onsummary statistics (means, medians). This requires thinking about the comparing the relative costs via Murphy diagrams, CRPS, DRPS, etc.

As a critical consideration here, being “Bayesian” in this context is not to admit that one class or set of prior beliefs exists about the data, the forecast models, or the parmeters alone (in these models). The idea here is one that models and probabilistic statements are open for interpretation and the weighting of evidence about what one more or less sees as true (for approaches to this see (Gill 2021; McElreath 2018)). So based on in- and out-of-sample comparisons as an assessment will and ought to be made that looks at the relative properties and beliefs about the performance of models and their forecast densities.

This file and the code here estimates the basic models reported and selected in the analyses. The code is not meant to be fast — it runs serially — and does not include any optimizations for running in parallel (e.g., run the same model with \(k\) different PDFs on separate cores). It is meant to be illustrative, though it is what is used for the model selection steps for the subsequent Bayesian density forecasts.

Data setup

Here begin with setting up the training data as provided by ViEWS. This takes the data as given from ViEWS and reads it into several data frames and some subsets.

Reading in data

Now these are compressed, which means there are only data for all observations that are observed.

If you run your own version of this, you will likely need to change the paths to the input files. You can get te input files here.

# For the parquet files
library(arrow)
## 
## Attaching package: 'arrow'
## The following object is masked from 'package:utils':
## 
##     timestamp
# CM level data: features and outcomes
cmf <- read_parquet("~/ViEWS2/shared_competition_data/cm_features_to_oct2017.parquet")

#cmf17 <- read_parquet("~/ViEWS2/shared_competition_data/cm_features_to_oct2017.parquet")
cmf18 <- read_parquet("~/ViEWS2/shared_competition_data/cm_features_to_oct2018.parquet")
#cmf19 <- read_parquet("~/ViEWS2/shared_competition_data/cm_features_to_oct2019.parquet")
#cmf20 <- read_parquet("~/ViEWS2/shared_competition_data/cm_features_to_oct2020.parquet")

# CM level data: outcomes
cma <- read_parquet("~/ViEWS2/shared_competition_data/cm_actuals_2018.parquet")

#cma18 <- read_parquet("~/ViEWS2/shared_competition_data/cm_actuals_2018.parquet")
#cma19 <- read_parquet("~/ViEWS2/shared_competition_data/cm_actuals_2019.parquet")
#cma20 <- read_parquet("~/ViEWS2/shared_competition_data/cm_actuals_2020.parquet")
#cma21 <- read_parquet("~/ViEWS2/shared_competition_data/cm_actuals_2021.parquet")

countries <- read.csv("~/ViEWS2/shared_competition_data/country_list.csv")

# Save
save.image("ViEWS2-training.RData")

Modified data to deal with some missing cases

This will not work with some of the smoothing spline methods (which cannot handle missing data), but need to be told that it is missing and where it may be missing. To address this, the following code grabs the main identifiers from the cm data and pads it out appropriately.

# Start by making a sub-sample, since we want to just test code at this stage
# so this will make things faster.
x <- cmf[,45:95]    # Some features
y <- cmf$ged_sb     # Main QoI
yl <- cmf[,10:15]   # Lags of QoI

dt1 <- as.data.frame(cbind(as.factor(cmf$country_id), cmf$month_id, y, yl, x))
colnames(dt1) <- c("series", "time", "y", colnames(yl), colnames(x))

# Cleanup
rm(y,x,yl)

Now we need to check the completeness of the data / time balances. This is necessary because the mvgam package has requirements for this to make the mgcv smoothing splines for the time series models of interest here.

library(tidyr)
# Data setup check step from mvgam() : below is direct from the package
all_times_avail = function(time, min_time, max_time){
  identical(as.numeric(sort(time)),
            as.numeric(seq.int(from = min_time, to = max_time)))
}

# Data defn step
data_train <- dt1
min_time <- min(data_train$time)
max_time <- max(data_train$time)

data.frame(series = data_train$series,
           time = data_train$time) %>%
  dplyr::group_by(series) %>%
  dplyr::summarise(all_there = all_times_avail(time,
                                               min_time,
                                               max_time)) -> checked_times
# if(any(checked_times$all_there == FALSE)){
#   stop('One or more series in "data" is missing observations 
#         for one or more timepoints', call. = FALSE)
# }

# Look at the results
# table(checked_times)

# Observations in the training data per country
# rowSums(table(dt1$series, dt1$time))

# These are the ones that are complete (as compared to those in checked_times)
allt <- checked_times$series[checked_times$all_there==TRUE]

So at this stage one faces a modeling choice to use the mvgam models:

  1. drop all of the incomplete cases (i.e., those where data are not recorded over some time periods), or

  2. pad the dataset

# Full, padded data
tmp <- expand.grid(series = unique(dt1$series), 
                   time = unique(dt1$time), 
                   KEEP.OUT.ATTRS = FALSE)
tmp <- expand.grid(series = allt, time = unique(dt1$time), 
                   KEEP.OUT.ATTRS = FALSE)

# Not adding all of the empty covariates (yet)
dt2 <- merge(tmp, dt1[,1:9], all.x = TRUE)

dt2 <- (dt1[dt1$series %in% allt,])
dt2$series <- droplevels(dt2$series)

# Here's the check
# Note that the data are fully padded out here (same was not true above):

rowSums(table(dt2$series, dt2$time, useNA = "always"))
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   16   17 
##  334  334  334  334  334  334  334  334  334  334  334  334  334  334  334  334 
##   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32   33 
##  334  334  334  334  334  334  334  334  334  334  334  334  334  334  334  334 
##   34   35   36   37   38   39   40   41   42   43   45   46   47   48   49   50 
##  334  334  334  334  334  334  334  334  334  334  334  334  334  334  334  334 
##   52   53   54   55   58   60   62   64   66   67   69   70   73   74   76   77 
##  334  334  334  334  334  334  334  334  334  334  334  334  334  334  334  334 
##   78   79   80   81   82   85   87   89   90   93   94   96   97   99  100  101 
##  334  334  334  334  334  334  334  334  334  334  334  334  334  334  334  334 
##  104  105  107  108  109  112  116  118  119  120  121  127  128  129  130  132 
##  334  334  334  334  334  334  334  334  334  334  334  334  334  334  334  334 
##  133  135  136  138  139  140  142  143  145  146  147  148  149  150  151  152 
##  334  334  334  334  334  334  334  334  334  334  334  334  334  334  334  334 
##  153  154  155  156  157  158  159  160  161  162  164  165  166  167  168  169 
##  334  334  334  334  334  334  334  334  334  334  334  334  334  334  334  334 
##  171  172  173  174  177  178  179  180  181  182  183  198  199  205  206  213 
##  334  334  334  334  334  334  334  334  334  334  334  334  334  334  334  334 
##  214  218  220  222  223  234  235  237  243  244 <NA> 
##  334  334  334  334  334  334  334  334  334  334    0

BAM / GAM trend models with different distributions

The Estimators in this section use basic unit specific models and allow for different dynamics and functional distributional forms. In this section the above is extended to look at the following:

  1. Poisson

  2. Zero inflated Poisson (to better model the “excess” or common null reporting of non-events).

  3. Negative binomial models: these fit the same mean function as the Poisson model, but they allow for a variance specification so that the variance of the counts is larger than the mean number of events per unit.

  4. Compound Poisson / Tweedie models. These are less well known generalizations of the above (at least among social scientists who are probably more familiar with (King 1989)). Details are in [P. K. Dunn and Smyth (2005); @dunn2008evaluation; P. Dunn (2017)].

dt1 <- dt1[,1:10]  # Only grab the first 10 col
# library(mvgam)
# library(lme4)
# library(gamm4)
library(mgcv)
## Loading required package: nlme
## This is mgcv 1.8-42. For overview type 'help("mgcv-package")'.
library(glmmTMB)
## Warning in checkDepPackageVersion(dep_pkg = "TMB"): Package version inconsistency detected.
## glmmTMB was built with TMB version 1.9.4
## Current TMB version is 1.9.6
## Please re-install glmmTMB from source or restore original 'TMB' package (see '?reinstalling' for more information)

Poisson models

Fits local Poisson and tensor Poisson models.

# Poisson models -- simple

psn.global <- gam(y ~ ged_sb_tlag_1 + s(time),
                     data=dt1,
           family = "poisson",
           method = "REML")

summary(psn.global)
## 
## Family: poisson 
## Link function: log 
## 
## Formula:
## y ~ ged_sb_tlag_1 + s(time)
## 
## Parametric coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.872e+00  1.031e-03    2786   <2e-16 ***
## ged_sb_tlag_1 1.569e-04  1.175e-07    1336   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df Chi.sq p-value    
## s(time) 8.999      9 417735  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  -0.193   Deviance explained =  9.1%
## -REML = 5.0097e+06  Scale est. = 1         n = 62474
plot(psn.global, pages=1)

psn.local <- bam(y ~ ged_sb_tlag_1 + s(time, series, bs="fs"),
                data=dt1,
                family = "poisson",
                discrete=TRUE)
## Warning in bgam.fitd(G, mf, gp, scale, nobs.extra = 0, rho = rho, coef = coef,
## : algorithm did not converge
## Warning in bgam.fitd(G, mf, gp, scale, nobs.extra = 0, rho = rho, coef = coef,
## : fitted rates numerically 0 occurred
summary(psn.local)
## 
## Family: poisson 
## Link function: log 
## 
## Formula:
## y ~ ged_sb_tlag_1 + s(time, series, bs = "fs")
## 
## Parametric coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -4.831e+03  1.022e+03  -4.729 2.26e-06 ***
## ged_sb_tlag_1 -2.871e-05  3.635e-07 -78.967  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                  edf Ref.df    Chi.sq p-value
## s(time,series) 872.9    311 3.048e+11   0.864
## 
## Rank: 2130/2132
## R-sq.(adj) =  -9.34e+05   Deviance explained = -1.57e+03%
## fREML = 9.7123e+05  Scale est. = 1         n = 62474
plot(psn.local, pages=1)

psn.local.te <- bam(y ~ ged_sb_tlag_1 + 
                      te(time, series, bs="fs"),
                 data=dt1,
                 family = "poisson",
                 discrete=TRUE)
## Warning in bgam.fitd(G, mf, gp, scale, nobs.extra = 0, rho = rho, coef = coef,
## : fitted rates numerically 0 occurred
summary(psn.local.te)
## 
## Family: poisson 
## Link function: log 
## 
## Formula:
## y ~ ged_sb_tlag_1 + te(time, series, bs = "fs")
## 
## Parametric coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -6.361e+03  0.000e+00    -Inf   <2e-16 ***
## ged_sb_tlag_1 -2.163e-05  3.399e-07  -63.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                   edf Ref.df    Chi.sq p-value    
## te(time,series) 476.7    892 528711971  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 1065/1067
## R-sq.(adj) =   0.31   Deviance explained = 81.2%
## fREML = 1.0936e+06  Scale est. = 1         n = 62474

Zero Inflated Poisson models

Fits local zero inflated Poisson and tensor zero inflated Poisson models.

# ZIP Models -- simple

zip.local <- bam(y ~ ged_sb_tlag_1 + s(time, series, bs="fs"),
                data=dt1,
                family = ziP(),
                discrete=TRUE)
## Warning in bgam.fitd(G, mf, gp, scale, nobs.extra = 0, rho = rho, coef = coef,
## : algorithm did not converge
summary(zip.local)
## 
## Family: Zero inflated Poisson(-0.719,0) 
## Link function: identity 
## 
## Formula:
## y ~ ged_sb_tlag_1 + s(time, series, bs = "fs")
## 
## Parametric coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.000e+00  0.000e+00     NaN      NaN    
## ged_sb_tlag_1 -3.310e-05  4.127e-07  -80.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                 edf Ref.df         F p-value    
## s(time,series) 1356   1712 2.089e+09  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 2114/2132
## Deviance explained =   75%
## fREML = 7.5093e+05  Scale est. = 1         n = 62474
plot(zip.local, pages=1)

# With tensors?

zip.local.te <- bam(y ~ ged_sb_tlag_1 + te(time, series, bs="fs"),
                   data=dt1,
                   family = ziP(),
                   discrete=TRUE)

summary(zip.local.te)
## 
## Family: Zero inflated Poisson(-0.827,0.086) 
## Link function: identity 
## 
## Formula:
## y ~ ged_sb_tlag_1 + te(time, series, bs = "fs")
## 
## Parametric coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6.017e+03  1.499e+03  -4.015 5.95e-05 ***
## ged_sb_tlag_1 -2.690e-05  3.825e-07 -70.324  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                   edf Ref.df        F p-value
## te(time,series) 455.6    104 1.42e+08   0.863
## 
## Deviance explained = 71.6%
## fREML = 8.4175e+05  Scale est. = 1         n = 62474

NB models

Fits local negative binomial and tensor negative binomial models.

nb.local <- bam(y ~ ged_sb_tlag_1 + s(time, series, bs="fs"),
                 data=dt1,
                 family = "nb",
                 discrete=TRUE)

summary(nb.local)
## 
## Family: Negative Binomial(0.263) 
## Link function: log 
## 
## Formula:
## y ~ ged_sb_tlag_1 + s(time, series, bs = "fs")
## 
## Parametric coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -5.1216480  0.2764095 -18.529  < 2e-16 ***
## ged_sb_tlag_1  0.0007774  0.0001007   7.721 1.17e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                  edf Ref.df     F p-value    
## s(time,series) 938.5   2009 9.012  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  -6.43e+28   Deviance explained = 98.5%
## fREML =  66637  Scale est. = 1         n = 62474
plot(nb.local, pages=1)

# With tensors?

nb.local.te <- bam(y ~ ged_sb_tlag_1 + te(time, series, bs="fs"),
                    data=dt1,
                    family = "nb",
                    discrete=TRUE)

summary(nb.local.te)
## 
## Family: Negative Binomial(0.213) 
## Link function: log 
## 
## Formula:
## y ~ ged_sb_tlag_1 + te(time, series, bs = "fs")
## 
## Parametric coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -4.9713374  0.2688804  -18.49   <2e-16 ***
## ged_sb_tlag_1  0.0011771  0.0001128   10.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                   edf Ref.df     F p-value    
## te(time,series) 679.2   1029 15.82  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  -8.21e+44   Deviance explained = 98.8%
## fREML =  66240  Scale est. = 1         n = 62474

Tweedie models

These are versions from (Adelson 1966) and then extended to GLMs by (Jorgensen 1987). Suppose our counts are \(Y\) (surpressing and subscripts for countries and time here). For the class of EDM (Exponential Dispersion Models) defined by Jorgensen, the Tweedie group (Tweedie et al. 1984) have the property that for a count distribution

\(V(Y) = f \cdot g(E(Y))\)

where for the variance \(V()\) , \(f\) is a dispersion parameter, and \(g()\) is a function for the mean-to-variance relationship. Here that is taken to be

\(g(E(Y)) = \phi E(Y)^p\)

These are a power law case between the mean and the variance. So the standard negative binomial falls out as the special cases where \(f>0\) and \(p=1\), since it just rescales the variance relative to the mean. But there are other special cases that come from the Tweedie / compound Poisson formulation. When \(p=0\) the model reduces in a GLM setup to a normal distribution. When \(p=1\) this is the assumption of the Poisson distribution for \(Y\). For \(p = 2\) the process is gamma distributed and for \(p=3\), \(Y\) is inverse Gaussian.^[These are well known in the statistics, medical, epidemiology, and ecology modeling communities. There are easy to use tools for this as well, e.g. from 2014.] In the regression-like approach introduced above, this adds a set of dispersion terms, likely \(1 < p < 2\) that are of interest.

Fits local Tweedie and tensor Tweedie models.

tw.local <- bam(y ~ ged_sb_tlag_1 + 
                  s(time, series, bs="fs"),
                data=dt1,
                family = tw(),
                discrete=TRUE)

summary(tw.local)
## 
## Family: Tweedie(p=1.581) 
## Link function: log 
## 
## Formula:
## y ~ ged_sb_tlag_1 + s(time, series, bs = "fs")
## 
## Parametric coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -8.517e+00  4.383e-01 -19.430   <2e-16 ***
## ged_sb_tlag_1  1.580e-05  1.065e-05   1.484    0.138    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                  edf Ref.df     F p-value    
## s(time,series) 858.8   2011 13.47  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.269   Deviance explained = 90.1%
## fREML = 1.4014e+05  Scale est. = 10.515    n = 62474
plot(tw.local, pages=1)

# With tensors?

tw.local.te <- bam(y ~ ged_sb_tlag_1 + te(time, series, bs="fs"),
                   data=dt1,
                   family = tw(),
                   discrete=TRUE)

summary(tw.local.te)
## 
## Family: Tweedie(p=1.59) 
## Link function: log 
## 
## Formula:
## y ~ ged_sb_tlag_1 + te(time, series, bs = "fs")
## 
## Parametric coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -8.105e+00  4.094e-01  -19.80  < 2e-16 ***
## ged_sb_tlag_1  5.245e-05  1.064e-05    4.93 8.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                   edf Ref.df     F p-value    
## te(time,series) 617.3   1029 24.49  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.193   Deviance explained =   88%
## fREML = 1.4396e+05  Scale est. = 11.972    n = 62474

Model selection results for in-sample fits

Here we report AIC values for the earlier models

AIC(psn.global)
## [1] 10019306
AIC(psn.local)
## [1] 183433165
AIC(psn.local.te)
## [1] 2102028
#AIC(nb.global)
AIC(nb.local)
## [1] 95305.3
AIC(nb.local.te)
## [1] 97434.81
AIC(zip.local)
## [1] 1379695
AIC(zip.local.te)
## [1] 1559792
AIC(tw.local)
## [1] 95147.76
AIC(tw.local.te)
## [1] 97384.88

Alternative models: GLMM specifications

This section fits the GLMM versions of the models over the densities: Poisson, negative binomial, and Tweedie.

# # GLMER version
# 
# psn.glmer <- glmer(y ~ (time | series), data=dt1, family="poisson")
# 
# #nb.glmer <- glmer.nb(y ~ (time | series), data=dt1)
# 
# # GAMM / GAMM 4 -- these need work since the groups are not right.
# 
# psn.gamm <- gamm4(y ~ ged_sb_tlag_1 + s(time),
#                   random=~(time|series), 
#                   data=dt1, 
#                   family = "poisson")
# 
# summary(psn.gamm$gam)
# summary(psn.gamm$mer)
# 
# plot(psn.gamm$gam)
# plot(psn.gamm$mer)
# 
# system.time(psn1.gamm <- gamm(y ~ ged_sb_tlag_1 + s(time),
#                   random=list(series=~time), 
#                   data=dt1, 
#                   family = "poisson"))
# 
# system.time(nb.gamm <- gamm(y ~ ged_sb_tlag_1 + s(time),
#                               random=list(series=~1), 
#                               data=dt1, 
#                               family = "nb", niterPQL = 100))
# 
# plot(nb.gamm$gam, pages=1)
# plot(nb.gamm$lme, pages=1)
# 
# summary(nb.gamm$gam)
# summary(nb.gamm$lme)
# 
# 
# system.time(tw.gamm <- gamm(y ~ ged_sb_tlag_1 + s(time),
#                             random=list(series=~1), 
#                             data=dt1, 
#                             family = "tw", niterPQL = 100))
# 
# plot(tw.gamm$gam, pages=1)
# plot(tw.gamm$lme, pages=1)
# 
# summary(tw.gamm$gam)
# summary(tw.gamm$lme)
# 
# system.time(psn.ar1.gamm <- gamm(y ~ s(time),
#                               correlation = corAR1(form=~-1|series),
#                               data=dt1,
#                               family = "poisson", niterPQL = 100))
# summary(psn.ar1.gamm$gam)
# summary(psn.ar1.gamm$lme)
# 
# plot(psn.ar1.gamm$gam, pages=1)
# plot(psn.ar1.gamm$lme, pages=1)
# 

# glmmTMB versions

# Need time as a factor for the next command...
dt1$time <- factor(dt1$time)

# This one will crash...
# options(glmmTMB.cores=4)
# psn.glmmtmb <- glmmTMB(y ~ (time|series), family = poisson,
#                        data=dt1)

psnar1.glmmtmb <- glmmTMB(y ~ ar1(time + 0|series), family = poisson,
                       data=dt1, control=glmmTMBControl(parallel = 4))

nbar1.glmmtmb <- glmmTMB(y ~ ar1(time + 0|series), family = nbinom1(),
                          data=dt1)

cmpar1.glmmtmb <- glmmTMB(y ~ ar1(time + 0|series), family = compois(),
                         data=dt1)

twar1.glmmtmb <- glmmTMB(y ~ ar1(time + 0|series), family = tweedie(),
                         data=dt1)

Model selection results for GLMM models

# Make a table of in-sample fits
library(bbmle)
## Loading required package: stats4
AICtab(psnar1.glmmtmb, nbar1.glmmtmb, cmpar1.glmmtmb, twar1.glmmtmb, 
       delta =FALSE, base=TRUE)
##                AIC     df
## twar1.glmmtmb  91328.1 5 
## nbar1.glmmtmb  91383.6 4 
## cmpar1.glmmtmb 92489.4 4 
## psnar1.glmmtmb 93946.6 3
AICctab(psnar1.glmmtmb, nbar1.glmmtmb, cmpar1.glmmtmb, twar1.glmmtmb)
##                dAICc  df
## twar1.glmmtmb     0.0 5 
## nbar1.glmmtmb    55.5 4 
## cmpar1.glmmtmb 1161.3 4 
## psnar1.glmmtmb 2618.5 3
BICtab(psnar1.glmmtmb, nbar1.glmmtmb, cmpar1.glmmtmb, twar1.glmmtmb)
##                dBIC   df
## twar1.glmmtmb     0.0 5 
## nbar1.glmmtmb    46.5 4 
## cmpar1.glmmtmb 1152.2 4 
## psnar1.glmmtmb 2600.4 3
summary(twar1.glmmtmb)
##  Family: tweedie  ( log )
## Formula:          y ~ ar1(time + 0 | series)
## Data: dt1
## 
##      AIC      BIC   logLik deviance df.resid 
##  91328.1  91373.3 -45659.0  91318.1    62469 
## 
## Random effects:
## 
## Conditional model:
##  Groups Name    Variance Std.Dev. Corr      
##  series time121 125.4    11.2     0.99 (ar1)
## Number of obs: 62474, groups:  series, 213
## 
## Dispersion parameter for tweedie family (): 5.13 
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -14.1305     0.6384  -22.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(nbar1.glmmtmb)
##  Family: nbinom1  ( log )
## Formula:          y ~ ar1(time + 0 | series)
## Data: dt1
## 
##      AIC      BIC   logLik deviance df.resid 
##  91383.6  91419.8 -45687.8  91375.6    62470 
## 
## Random effects:
## 
## Conditional model:
##  Groups Name    Variance Std.Dev. Corr      
##  series time121 54.24    7.365    0.99 (ar1)
## Number of obs: 62474, groups:  series, 213
## 
## Dispersion parameter for nbinom1 family ():   34 
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -8.1276     0.3806  -21.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Among the Poisson models
AICtab(psn.global, psn.local, psn.local.te)
##              dAIC        df   
## psn.local.te         0.0 479.2
## psn.global     7917277.8 11   
## psn.local    181331136.9 876
AICtab(nb.local, nb.local.te)
##             dAIC   df   
## nb.local       0.0 943.4
## nb.local.te 2129.5 683.9

Starting with basic model selecion begins with a GAMM that includes simple (time lagged) and deterministic effects for a global smooting trend and local trends. These are called the local models in what follows. These local models are estimated for the various distributions outlined above. To compare the in-sample fit AIC statistics are ranked as below

# Compare distributions
# AICtab(psn.local, zip.local, nb.local, tw.local)
# AICctab(psn.local, zip.local, nb.local, tw.local)
ICtab(psn.local, zip.local, nb.local, tw.local, 
      base=TRUE, delta=FALSE)
##           AIC         df    
## tw.local      95147.8 863.9 
## nb.local      95305.3 943.4 
## zip.local   1379695.5 1360.9
## psn.local 183433165.3 876

Clearly the models that admit some dispersion of the event counts are strongly preferred.^(The same is true using BIC, or any other related information criteria measure.)

One might then suggest, how much better or worse do these models do as one changes the dynamic specifiations for the trends? This is seen by looking at tensor products of the GAM splines for the global and local trends (so a temporally more sparse model that has better degrees of freedom). Even in this case, things are no better, and in fact worse than the specifications discussed above:

# Compare distributions
AICtab(psn.local.te, zip.local.te, nb.local.te, tw.local.te)
##              dAIC      df   
## tw.local.te        0.0 622.1
## nb.local.te       49.9 683.9
## zip.local.te 1462407.6 462.9
## psn.local.te 2004643.5 479.2
AICctab(psn.local.te, zip.local.te, nb.local.te, tw.local.te)
##              dAICc     df   
## tw.local.te        0.0 622.1
## nb.local.te       52.6 683.9
## zip.local.te 1462401.9 462.9
## psn.local.te 2004638.4 479.2
ICtab(psn.local.te, zip.local.te, nb.local.te, tw.local.te, 
      base=TRUE, delta=FALSE)
##              AIC       df   
## tw.local.te    97384.9 622.1
## nb.local.te    97434.8 683.9
## zip.local.te 1559792.4 462.9
## psn.local.te 2102028.4 479.2

If one changes the dynamic specification and moves them into the latent space (so what is done in the GAMM and GLMM-type models),

AIC(twar1.glmmtmb)
## [1] 91328.07
AIC(nbar1.glmmtmb)
## [1] 91383.6
AIC(cmpar1.glmmtmb)
## [1] 92489.35
AIC(psnar1.glmmtmb)
## [1] 93946.59

(Note: not all of the GAMM and GLMM model variants available on R give consistent reporting or computation of the AIC or likelihood-based fit measures.)

Generate Predictions and Comparisons for 2018

The code below generates plots of the models predictions, one-step.

# Plot local trends for each modeling specification
# 
# pdf(file="gam-local-in-sample.pdf")
par(mfcol=c(2,2))
plot(psn.local, ylab = "Poisson trends")
abline(h=0, col="red", lwd=2)
plot(zip.local, ylab = "ZiP trends")
abline(h=0, col="red", lwd=2)
plot(nb.local, ylab = "Negative binomial trends")
abline(h=0, col="red", lwd=2)
plot(tw.local, ylab = "Tweedie trends")

abline(h=0, col="red", lwd=2)

# dev.off()

# pdf(file="gam-tensor-in-sample.pdf")
par(mfcol=c(2,2))
vis.gam(psn.local.te, ylab = "Poisson tensor trends")
vis.gam(zip.local.te, ylab = "ZiP tensor trends")
vis.gam(nb.local.te, ylab = "Negative binomial tensor trends")
vis.gam(tw.local.te, ylab = "Tweedie tensor trends")

# dev.off()

# Predictions in-sample
local.preds <- cbind(dt1[,1:2],
                     predict(psn.local, type = "response"),
                     predict(nb.local, type = "response"),
#                     predict(zip.local, type = "response"),
                     predict(tw.local, type = "response"))
colnames(local.preds) <- c("series", "time", "P", "NB", "TW")

# Tensor preds
tensor.preds <- cbind(dt1[,1:2],
                      predict(psn.local.te, type = "response"),
                      predict(nb.local.te, type = "response"),
                      predict(tw.local.te, type = "response"))
colnames(tensor.preds) <- c("series", "time", "P", "NB", "TW")

# GLMM preds
glmm.preds <- cbind(dt1[,1:2],
                    predict(psnar1.glmmtmb, type = "response"),
                    predict(nbar1.glmmtmb, type = "response"),
#                    predict(cmpar1.glmmtmb, type = "response"),
                    predict(twar1.glmmtmb, type = "response"))
colnames(glmm.preds) <- c("series", "time", "P", "NB", "TW")

# # GAMM Preds
# gamm.preds <- cbind(dt1[,1:2],
#                     exp(fitted(psn1.gamm$lme)),
#                     exp(fitted(psn.ar1.gamm$lme)),
#                     exp(fitted(nb.gamm$lme)),
#                     exp(fitted(tw.gamm$lme)))
# colnames(gamm.preds) <- c("series", "time", "P", "P-AR", "NB", "TW")

CRPS results

Here we show how to compute the CRPS for a single time period for each forecast model.

# Get last obs for comparison of an "in-sample"
lastobs <- dt1[dt1$time==454,]
local.last <- local.preds[local.preds$time==454,]
tensor.last <- tensor.preds[tensor.preds$time==454,]
glmm.last <- glmm.preds[glmm.preds$time==454,]
#gamm.last <- gamm.preds[gamm.preds$time==454,]

# Make draws from the relevant modal predictions

# Local preds in-sample pdf
N <- 1000;    # Number of draws
n <- dim(lastobs)[1]
k <-  1;      #  number of forecasts

set.seed(1234)
local.P.fc <- sapply(1:n, function(i) {rpois(N, local.last$P[i])})
theta <- exp(nb.local$family$getTheta())
local.NB.fc <- sapply(1:n, function(i) {rnbinom(N, size=theta, mu=local.last$NB[i])})
local.TW.fc <- t(replicate(N, rTweedie(local.last$TW, p=1.581)))

# Tensor preds in-sample pdf
tensor.P.fc <- sapply(1:n, function(i) {rpois(N, tensor.last$P[i])})
theta <- exp(nb.local.te$family$getTheta())
tensor.NB.fc <- sapply(1:n, function(i) {rnbinom(N, size=theta, mu=tensor.last$NB[i])})
tensor.TW.fc <- t(replicate(N, rTweedie(tensor.last$TW,
                                        p=tw.local.te$family$getTheta(TRUE))))

# glmm preds in-sample pdf
glmm.P.fc <- sapply(1:n, function(i) {rpois(N, glmm.last$P[i])})
glmm.NB.fc <- sapply(1:n, function(i) {rnbinom(N, size=34, mu=glmm.last$NB[i])})
glmm.TW.fc <- t(replicate(N, rTweedie(glmm.last$TW, p=1.36)))

# gamm preds in-sample pdf
# gamm.P.fc <- sapply(1:n, function(i) {rpois(N, gamm.last$P[i])})
# gamm.PAR.fc <- sapply(1:n, function(i) {rpois(N, gamm.last$`P-AR`[i])})
# gamm.NB.fc <- sapply(1:n, function(i) {rnbinom(N, size=34, mu=gamm.last$NB[i])})
# gamm.TW.fc <- t(replicate(N, rTweedie(gamm.last$TW, p=(tw.gamm$gam$family$getTheta(TRUE)))))

# crps

library(scoringutils)
cat("Poisson, local:", sum(crps_sample(lastobs$y, t(local.P.fc))), "\n")
## Poisson, local: 1778.035
cat("NB, local     :", sum(crps_sample(lastobs$y, t(local.NB.fc))), "\n")
## NB, local     : 3458.133
cat("Tweedie, local:", sum(crps_sample(lastobs$y, t(local.TW.fc))), "\n")
## Tweedie, local: 1593.813
cat("Poisson, tensor:", sum(crps_sample(lastobs$y, t(tensor.P.fc))), "\n")
## Poisson, tensor: 2463.162
cat("NB, tensor     :",sum(crps_sample(lastobs$y, t(tensor.NB.fc))), "\n")
## NB, tensor     : 3910.693
cat("Tweedie, tensor:",sum(crps_sample(lastobs$y, t(tensor.TW.fc))), "\n")
## Tweedie, tensor: 3276.774
cat("Poisson, glmm  :",sum(crps_sample(lastobs$y, t(glmm.P.fc))), "\n")
## Poisson, glmm  : 63.32284
cat("NB, glmm       :",sum(crps_sample(lastobs$y, t(glmm.NB.fc))), "\n")
## NB, glmm       : 460.1975
cat("Tweedie, glmm  :",sum(crps_sample(lastobs$y, t(glmm.TW.fc))), "\n")
## Tweedie, glmm  : 243.5304
# cat("Poisson, gamm  :",sum(crps_sample(lastobs$y, t(gamm.P.fc))), "\n")
# cat("PoissonAR, gamm:",sum(crps_sample(lastobs$y, t(gamm.PAR.fc))), "\n")
# cat("NB, gamm       :",sum(crps_sample(lastobs$y, t(gamm.NB.fc))), "\n")
# cat("Tweedie, gamm  :",sum(crps_sample(lastobs$y, t(gamm.TW.fc))), "\n")

Now predict all of 2018 using the data through October 2017 (14 months)

We are not going to go through the cost of re-estimation here: just roll the models into the end of 2018. This provides the 2018 forecasts based only on data through October 2017.

local.P.2018 <- local.NB.2018 <- local.TW.2018 <- vector(mode = "list", length=14)

k1 <- 454

for(i in 1:14)
{
  # Local models
  old.p <- apply(local.P.fc, 2, mean)
  tmp.p <- predict(psn.local, data.frame(series=local.last[,1], 
                                time = rep((k1+i), 191), 
                                ged_sb_tlag_1 = old.p),
            type = "response")
  local.P.fc <- sapply(1:n, function(i) {rpois(N, tmp.p[i])})
  
  old.nb <- apply(local.NB.fc, 2, mean)
  tmp.nb <- predict(nb.local, data.frame(series=local.last[,1], 
                                time = rep((k1+i), 191), 
                                ged_sb_tlag_1 = old.nb),
            type = "response")
  theta <- exp(nb.local$family$getTheta())
  local.NB.fc <- sapply(1:n, function(i) {rnbinom(N, size=theta, mu=tmp.nb[i])})

  old.tw <- apply(local.TW.fc, 2, mean)
  tmp.tw <- predict(tw.local, data.frame(series=local.last[,1], 
                                time = rep((k1+i), 191), 
                                ged_sb_tlag_1 = old.tw),
            type = "response")
  local.TW.fc <- t(replicate(N, rTweedie(tmp.tw, p=1.581)))

  local.P.2018[[i]] <- local.P.fc
  local.NB.2018[[i]] <- local.NB.fc
  local.TW.2018[[i]] <- local.TW.fc
}


# Now the same for the other models...

tensor.P.2018 <- tensor.NB.2018 <- tensor.TW.2018 <- vector(mode = "list", length=14)

k1 <- 454

for(i in 1:14)
{
  # tensor models
  old.p <- apply(tensor.P.fc, 2, mean)
  tmp.p <- predict(psn.local.te, data.frame(series=tensor.last[,1], 
                                time = rep((k1+i), 191), 
                                ged_sb_tlag_1 = old.p),
            type = "response")
  tensor.P.fc <- sapply(1:n, function(i) {rpois(N, tmp.p[i])})
  
  old.nb <- apply(tensor.NB.fc, 2, mean)
  tmp.nb <- predict(nb.local.te, data.frame(series=tensor.last[,1], 
                                time = rep((k1+i), 191), 
                                ged_sb_tlag_1 = old.nb),
            type = "response")
  theta <- exp(nb.local.te$family$getTheta())
  tensor.NB.fc <- sapply(1:n, function(i) {rnbinom(N, size=theta, mu=tmp.nb[i])})

  old.tw <- apply(tensor.TW.fc, 2, mean)
  tmp.tw <- predict(tw.local.te, data.frame(series=tensor.last[,1], 
                                time = rep((k1+i), 191), 
                                ged_sb_tlag_1 = old.tw),
            type = "response")
  tensor.TW.fc <- t(replicate(N, rTweedie(tmp.tw, p=1.581)))

  tensor.P.2018[[i]] <- tensor.P.fc
  tensor.NB.2018[[i]] <- tensor.NB.fc
  tensor.TW.2018[[i]] <- tensor.TW.fc
}


# glmm -- these are hard to do, since they do not easily admit more time
glmm.P.2018 <- glmm.NB.2018 <- glmm.TW.2018 <- vector(mode = "list", length=14)

k1 <- 454

for(i in 1:14)
{
  # glmm models
  old.p <- apply(glmm.P.fc, 2, mean)
  tmp.p <- predict(psnar1.glmmtmb, newdata=data.frame(series=glmm.last[,1],
                                time = as.factor(rep((k1+i), 191)),
                                ged_sb_tlag_1 = old.p),
            type = "response", allow.new.levels=TRUE)
  glmm.P.fc <- sapply(1:n, function(i) {rpois(N, tmp.p[i])})

  old.nb <- apply(glmm.NB.fc, 2, mean)
  tmp.nb <- predict(nbar1.glmmtmb, data.frame(series=glmm.last[,1],
                                time = as.factor(rep((k1+i), 191)),
                                ged_sb_tlag_1 = old.nb),
            type = "response", allow.new.levels=TRUE)
#  theta <- exp(nbar1.glmmtmb$family$getTheta())
  glmm.NB.fc <- sapply(1:n, function(i) {rnbinom(N, size=34, mu=tmp.nb[i])})

  old.tw <- apply(glmm.TW.fc, 2, mean)
  tmp.tw <- predict(twar1.glmmtmb, data.frame(series=glmm.last[,1],
                                time = as.factor(rep((k1+i), 191)),
                                ged_sb_tlag_1 = old.tw),
            type = "response", allow.new.levels=TRUE)
  glmm.TW.fc <- t(replicate(N, rTweedie(tmp.tw, p=1.581)))

  glmm.P.2018[[i]] <- glmm.P.fc
  glmm.NB.2018[[i]] <- glmm.NB.fc
  glmm.TW.2018[[i]] <- glmm.TW.fc
}


# Not going to do the GAMM here, since the CRPS are so bad
# # gamm
# gamm.P.2018 <- gamm.NB.2018 <- gamm.TW.2018 <- vector(mode = "list", length=14)
# 
# k1 <- 454
# 
# for(i in 1:14)
# {
#   # gamm models
#   old.p <- apply(gamm.P.fc, 2, mean)
#   tmp.p <- predict(psn1.gamm, data.frame(series=gamm.last[,1],
#                                 time = rep((k1+i), 191),
#                                 ged_sb_tlag_1 = old.p),
#             type = "response")
#   gamm.P.fc <- sapply(1:n, function(i) {rpois(N, tmp.p[i])})
# 
# #   old.nb <- apply(gamm.NB.fc, 2, mean)
# #   tmp.nb <- predict(nb.local.te, data.frame(series=gamm.last[,1], 
# #                                 time = rep((k1+i), 191), 
# #                                 ged_sb_tlag_1 = old.nb),
# #             type = "response")
# #   theta <- exp(nb.local.te$family$getTheta())
# #   gamm.NB.fc <- sapply(1:n, function(i) {rnbinom(N, size=theta, mu=tmp.nb[i])})
# # 
# #   old.tw <- apply(gamm.TW.fc, 2, mean)
# #   tmp.tw <- predict(tw.local.te, data.frame(series=gamm.last[,1], 
# #                                 time = rep((k1+i), 191), 
# #                                 ged_sb_tlag_1 = old.tw),
# #             type = "response")
# #   gamm.TW.fc <- t(replicate(N, rTweedie(tmp.tw, p=1.581)))
# # 
#   gamm.P.2018[[i]] <- gamm.P.fc
# #   gamm.NB.2018[[i]] <- gamm.NB.fc
# #   gamm.TW.2018[[i]] <- gamm.TW.fc
# }
# 
# # 
# # 

Save the outputs for additional post-processing and analysis

Here we save the image of the results of the model estimation and model selection for additional processing in forecast output and comparisons in other files.

save.image("ViEWS2-prelim.RData")

References

Adelson, R. M. 1966. “Compound Poisson Distributions.” Journal of the Operational Research Society 17 (1): 73–75. https://doi.org/10.1057/jors.1966.8.
Dunn, Peter K, and Gordon K Smyth. 2005. “Series Evaluation of Tweedie Exponential Dispersion Model Densities.” Statistics and Computing 15: 267–80.
Dunn, PK. 2017. “Tweedie: Evaluation of Tweedie Exponential Family Models.” R Package Version 2 (3).
Gill, Jeff. 2021. “Political Science Is a Data Science.” The Journal of Politics 83 (1): 1–7.
Jorgensen, Bent. 1987. “Exponential Dispersion Models.” Journal of the Royal Statistical Society. Series B (Methodological) 49 (2): 127–62. http://www.jstor.org/stable/2345415.
King, Gary. 1989. “Variance Specification in Event Count Models: From Restrictive Assumptions to a Generalized Estimator.” American Journal of Political Science, 762–84.
McElreath, Richard. 2018. Statistical Rethinking: A Bayesian Course with Examples in r and Stan. Chapman; Hall/CRC.
Tweedie, Maurice CK et al. 1984. “An Index Which Distinguishes Between Some Important Exponential Families.” In Statistics: Applications and New Directions: Proc. Indian Statistical Institute Golden Jubilee International Conference, 579:579–604.